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Noise Reduction and Audio- Visual Speech Activity Detection 
FIELD AND BACKGROUND OF THE INVENTION 

The present invention generally relates to the field of noise reduction based on speech ac- 
tivity recognition, in particular to an audio-visual user interface of a telecommunication 
device running an application that can advantageously be used e.g. for a near-speaker de- 
tection algorithm in an environment where a speaker's voice is interfered by a statistically 
distributed background noise including environmental noise as well as surrounding per- 
sons' voices. 

Discontinuous transmission of speech signals based on speech/pause detection represents a 
valid solution to improve the spectral efficiency of new-generation wireless communication 
systems. In this context, robust voice activity detection algorithms are required, as conven- 
tional solutions according to the state of the art present a high misclassification rate in the 
presence of the background noise typical of mobile environments. 

A voice activity detector (VAD) aims to distinguish between a speech signal and several 
types of acoustic background noise even with low signal-to-noise ratios (SNRs). Therefore, 
in a typical telephone conversation, such a VAD, together with a comfort noise generator 
(CNG), is used to achieve silence compression. In the field of multimedia communications, 
silence compression allows a speech channel to be shared with other types of information, 
thus guaranteeing simultaneous voice and data applications. In cellular radio systems which 
are based the Discontinuous Transmission (DTX) mode, such as GSM, VADs are applied 
to reduce co-channel interference and power consumption of the portable equipment. Fur- 
thermore, a VAD is vital to reduce the average data bit rate in future generations of digital 
cellular networks such as the UMTS, which provide for a variable bit-rate (VBR) speech 
coding. Most of the capacity gain is due to the distinction between speech activity and in- 
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activity. The performance of a speech coding approach which is based on phonetic classi- 
fication, however, strongly depends on the classifier, which must be robust to every type of 
background noise. As is well known, the performance of a VAD is critical for the overall 
speech quality, in particular with low SNRs. In case speech frames are detected as noise 
intelligibility is seriously impaired owing to speech clipping in the conversation. If, on the 
other hand, the percentage of noise detected as speech is high, the potential advantages of 
silence compression are not obtained. In the presence of background noise it may be diffi- 
cult to distinguish between speech and silence. Hence, for voice activity detection in wire- 
less environments more efficient algorithms are needed. 

Although the Fuzzy Voice Activity Detector (FVAD) proposed in , .Improved VAD G 729 
Annex B for Mobile Communications Using Soft Computing" (Contribution ITU-T Study 
Group 16, Question 19/16, Washington, September 2-5, L997) by F. Beritelli S Casale 
and A Cavallaro performs better than other solutions presented in literature, it exhibits » 
activity increase, above all in the presence of non-stationary noise. The functional scheme 
of the FVAD is based on a traditional pattern recognition approach wherein the four differ- 
ential parameters used for speech activity/inactivity classification are the full-band energy 
difference, the low-band energy difference, the zero-crossing difference, and the spectral 
distortion. The matching phase is performed by a set of fuzzy rules obtained automatically 
by means of a new hybrid learning tool as described in JFuGeNeSys: Fuzzy Genetic Neural 
System for Fuzzy Modeling" by M. Russo (to appear in IEEE Transaction on Fuzzy Sys- 
--tem^As-is-^ 

a sharp change between two values. Thus, the Fuzzy VAD returns a continuous output sig- 
nal ranging from 0 (non-activity) to 1 (activity), which does not depend on whether single 
input signals have exceeded a predefined threshold or not, but on an overall evaluation of 
the values they have assumed („defuzzyfication process")- The final decision is made by 
comparing the output of the fuzzy system, which varies in a range between 0 and 1 with a 
fixed threshold experimentally chosen as described in "Voice Control of the Pan-European 
Digital Mobile Radio System" (ICC '89, pp. 1070-1074) by C. B. Southcott et al. 

Just as voice activity detectors conventional automatic speech recognition (ASR) systems 
also experience difficulties when being operated in noisy environments since accuracy of 
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conventional ASR algorithms largely decreases in noisy environments. When a speaker is 
talking m a noisy environment including both ambient noise as well as surrounding per 
sons' interfering voices, a microphone picks up not only the speaker's voice but also these 
background sounds. Consequently, an audio signal which encompasses the speaker's voice 
supenmposed by said background sounds is processed. The louder the interfering sounds 
the more the acoustic comprehensibility of the speaker is reduced. To overcome this prob- 
lem, noise reduction circuitries are applied that take use of the different frequency regions 
of environmental noise and the respective speaker's voice. 

A typical noise reduction circuitry for a telephony-based application based on a speech 
activity estimation algorithm according to the state of the art that implements a method for 
correlating the discrete signal spectrum S{hAf) of an analog-to-digital-converted audio sig 
nal ,(,) with an audio speech activity estimate is shown inFig. 2a. Said audio speech activ 
ity estimate is obtained by an amplitude detection of the digital audio signal s(nT) The 
15 circuit outputs a noise-reduced audio signal i,(„r), which is calculated by subjecting the 
difference of the discrete signal spectrum ^-A^ and a sampled version d„„(>t. A/) ofthe 
estimated noise power density spectrum O mW of a statistically distributed background 
noise n(t) to an Inverse Fast Fourier Transform (IFFT). 

BRIEF DESCRIPTION OFTHE STATE OF THE ART 

The invention described in US 5,313,522 refers to a device for facilitating comprehension 
by a heanng-impaired person participating in a telephone conversation, which comprises a 
circuitry for converting received audio speech signals into a series of phonemes and an 
arrangement for coupling the circuitry to a POTS line. The circuit thereby includes an ar 
rangement which correlates the detected series of phonemes with recorded Up movements 
of a speaker and displays these Hp movements in subsequent images on a display device 
thereby permitting the hearing-impaired person to carry out a lipreading procedure while 
listening to the telephone conversation, which improves the person's level of comprehen 
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The invention disclosed in WO 99/52097 pertains to a communication device and a method 
for sensing the movements of a speaker's lips, generating an audio signal corresponding to 
detected lip movements of said speaker and transmitting said audio signal, thereby sensing 
a level of ambient noise and accordingly controlling the power level of the audio signal to 
5 be transmitted. 

OBJECT OF THE UNDERLYING INVENTION 



In view of the state of the art mentioned above, it is the object of the present invention to 
10 enhance the speech/pause detection accuracy of a telephony-based voice activity detection 
(VAD) system. In particular, it is the object of the invention to increase the signal-to-inter- 
ference ratio (SIR) of a recorded speech signal in crowded environments where a speaker's 
voice is severely interfered by ambient noise and/or surrounding persons' voices. 

15 The aforementioned object is achieved by means of the features in the independent claims. 
Advantageous features are defined in the subordinate claims. 

SUMMARY OF THE INVENTION 
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The present invention is dedicated to a noise reduction and automatic speech activity rec- 
ognition system having an audio-visual user interface, wherein said system is adapted for 

extracted from a digital video sequence v(nT) showing a speaker's face by detecting and 
analyzing e.g. lip movements and/or facial expressions of said speaker S, with an audio 
feature vector which comprises features extracted from a recorded analog audio se- 
quence s(Q. Said audio sequence s(t) thereby represents the voice of said speaker S, inter- 
fered by a statistically distributed background noise 

«'(') = n(t)+s /nl (t), (l) 

which includes both environmental noise „(r) and a weighted sum of surrounding persons' 
interfering voices 
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s h , (0 « Z «/ • (* - T y ) (fory * 0 (2a) 



with a, = - 1 [m' 2 ] (2b) 



5 in the environment of said speaker S h Thereby, N denotes the total number of speakers (in- 
clusive of said speaker S t ), ctj is the attenuation factor for the interference signal s/t) of the 
y-th speaker Sj in the environment of the speaker 2} is the delay of */0. and R JM denotes 
the distance between the y-th speaker Sj and a microphone recording the audio signal *(*). 
By tracking the lip movement of a speaker, visual features are extracted which can then be 

10 analyzed and used for further processing. For this reason, the bimodal perceptual user inter- 
face comprises a video camera pointing to the speaker's face for recording a digital video 
sequence v(nT) showing lip movements and/or facial expressions of said speaker S h audio 
feature extraction and analyzing means for detennining acoustic-phonetic speech charac- 
teristics of the speaker's voice and pronunciation based on the recorded audio sequence 

15 sit), and visual feature extraction and analyzing means for continuously or intermittently 
determining the current location of the speaker's face, tracking lip movements and/or facial 
expressions of the speaker in subsequent images and determining acoustic-phonetic speech 
characteristics of the speaker's voice and pronunciation based on the detected Up move- 
ments and/or facial expressions. 



According to the invention, the aforementioned extracted and analyzed visual features are 
fed to a noise reduction circuit that is needed to increase the signal-to-interference ratio 
(SIR) of the recorded audio signal s{£). Said noise reduction circuit is specially adapted to 
perform a near-speaker detection by separating the speaker's voice from said background 
25 noise n(f) based on the derived acoustic-phonetic speech characteristics 



£my.r := 2v^r T ] T (3) 
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I. outputs a speech activity indication sigrta. (S,(nT)) which is obtained by a combination 
of speech activity estimates supplied by said audio feature extraction and analyzing means 
as well as said visual feature extinction and analyzing means. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Advantageous features, aspects, and useful embodiments of the invention will become evi- 
tOU " WiaS deSCriP ' i0n ' ^ aPPen<,ed ^ ~ W, drawings. 



Figl 



Fig. 2a 



Fig 2b 



shows a noise reduction and speech activity recognition system having an audio- 
vtsual user interface, said system being specially adapted for running a real-time 
hp toddng application which combines visual features <w extracted from a 
digital video sequence v(„2) showing the face of a speaker S, by detecting and 
analyzing the speaker's lip movements and/or facial expressions with audio fea 
tures ^extracted from an analog audio sequence r«) representing Ore voice of 
said speaker S, interfered by a statistically distributed background noise „•(,), 

is a block diagram showing a conventional noise reduction and speech activity 
recognition system for a telephony-based apphction based on an audio speech 
activ ity estimation according to the state of the art, 

shows an example of a camem-enhanced noise reduction and speoch activity 
recognition system for a «e.ephony-base4 application that tenements an audio- 
visual speech activity estimation algorithm according to one embodiment of die 
present invention, 



Fig. 2c 



shows an example of a camera-enhanced noise reduction and speech activity 
recognition system for a telephony-based apphcation tha, implements an audio- 
visual speech activity estimation mgorithm according to a further embodiment of 
the present invention, 
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Fig.3a ^.fccte^,^^^^ 

no.se eve, of a detected analog audio sconce according to fte Mmeht 
depicted in Fig. 1 of the present invention, 

« 8 . 3b shows a flow ^ ilIustming . ^ ^ detect . on ^ 
the embodnnent depicted in Fig. 2b of the present invention, and 

Fig. 3c shows a flow char, illustrating a near-end speaker detection method according to 
the embodiment depicted in Fig. 2c of the present invention. 

DETAILED DESCRIPTION OF THE UNDERLYING INVENTION 

2, and 3a-c Shan be exp.ained in detai,. The meaning of the symbol designated^ 
erenceniimerafe assigns in Figs. I to 3c can be taken from an annexed rabte. 

According to a firs, embodiment of me invention as depicted in Fig. said noise redaction - 
and speech activity recognition system .00 comprises a noise redaction circuit ,0^ ■ 

1"! — to «— *° — -'(0 -ived by a microphone ,o ! 

and to perform a near-speaker detection hy separating the speaks voice from said hi 
ground „o,se „■„ as weU as a mum^annel acoustic echo canceUation unit ,08 heingte- 

nthm based on acoustic-phonetic speech characteristics derived with the aid of the afol 
mentioned audio and visual feature extinction and analyzing means 1( Ma +b 

of a speaker's mouth as an estimate of the acoustic energy of articuiated vowe,s or Z 
thongs, respective* rapid movement of the speaker's hps as ahin, to bbia, or labio-d^ 
onsonante (e .g. p.osive, mcative or amative phonemes - voiced or unvoiced^ ^ 
bveW. and other statistical detected phonetic characteristics of an asscciaoo^Zn 
position and movement of the tips and me voice and pronunciation of a speaker « 
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The aforementioned noise reduction circuit 106 comprises digital signal processing means 
106a for calculating a discrete signal spectrum S(k Af) that corresponds to an analog-to- 
digital-converted version s(nT) of the recorded audio sequence by performing a Fast 
Fourier Transform (FFT), audio feature extraction and analyzing means 106b (e.g. an am- 
plitude detector) for detecting acoustic-phonetic speech characteristics of a speaker's voice 
and pronunciation based on the recorded audio sequence *(,), means 106c for estimating 
the noise power density spectrum <*>„„(/) of the statistically distributed background noise 
n>(0 based on the result of the speaker detection procedure performed by said audio feature 
extraction and analyzing means 106b, a subtracting element 106d for subtracting a discre- 
tized version O m (lcAf) of the estimated noise power density spectrum $„,(/) from the 
discrete signal spectrum S(hAJ) of the analog-to-digital-converted audio sequence sfrT) 
and digital signal processing means 106e for calculating the corresponding discrete time- 
domain signal S t (nT) of the obtained difference signal by performing an Inverse Fast Fou- 
rier Transform (IFFT). 

The depicted noise reduction and speech activity recognition system 100 comprises audio 
feature extraction and analyzing means 106b which are used for determining acoustic-pho- 
netic speech characteristics of the speaker's voice and pronunciation based on the 
recorded audio sequence *(,) and visual feature extraction and analyzing means 104a + b for 
deternrining the current location of the speaker's face at a data rate of 1 frame/s, tracking 

determining acoustic-phonetic speech characteristics of the speaker's voice and pronuncia- 
tion based on detected lip movements and/or facial expressions (o v , nT ). 

As depicted in Fig. 1, said noise reduction system 200b/c can advantageously be used for a 
video-telephony based application in a telecommunication system running on a video-en- 
abled phone 102 which is equipped with a built-in video camera 101b' pointing at the face 
of a speaker S, participating in a video telephony session. 

Fig. 2b shows an example of a slow camera-enhanced noise reduction and speech activity 
recognition system 200b for a telephony-based application which implements an audio- 
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visual speech activity estimation algorithm according to one embodiment of the present 
invention. Thereby, an audio speech activity estimate taken from, an audio feature vector 
a,,; supplied by said audio feature extraction and analyzing means 106b is correlated with a 
further speech activity estimate that is obtained by calculating the difference of the discrete 
signal spectrum S(k Af) and a sampled version • A/) of the estimated noise power 

density spectrum O nn (/) of the statistically distributed background noise «'(/). Said audio 
speech activity estimate is obtained by an amplitude detection of the band-pass-filtered dis- 
crete signal spectrum S(kAf) of the analog-to-digital-converted audio signal s(t). 



Similar to the embodiment depicted in Fig. 1, the noise reduction and speech activity rec- 
ognition system 200b depicted in Fig. 2b comprises an audio feature extraction and ana- 
lyzing means 106b (e.g. an amplitude detector) which is used for determining acoustic-pho- 
netic speech characteristics of the speaker's voice and pronunciation (a,,^) based on the 
recorded audio sequence s(t) and visual feature extraction and analyzing means 104' and 
104' ' for determining the current location of the speaker's face at a data rate of 1 frame/s, 
tracking Up movements and facial expressions of said speaker S, at a data rate of 15 
frames/s and determining acoustic-phonetic speech characteristics of the- speaker's voice 
and pronunciation based on detected Up movements and/or facial expressions (o v>nT ). 
Thereby, said audio feature extraction and analyzing means 106b can simply be reahzed as 
20 an ampUtude detector. 

Aside from the components 106a-e described above with reference to Fig. 1, the noise re- 
duction circuit 106 depicted in Fig. 2b comprises a delay element 204, which provides a 
delayed version of the discrete signal spectrum S(k AJ) of the analog-to-digital-converted 
audio signal s(0, a first multiplier element 107a, which is used for correlating (S9) the dis- 
crete signal spectrum S^hAJ) of a delayed version s(nT-x) of the analog-to-digital-con- 
verted audio signal s(nT) with a visual speech activity estimate taken from a visual feature 
vector o v j suppUed by the visual feature extraction and analyzing means 104a+b and/or 
104-+104", thus yielding a further estimate S,'(f) for updating the estimate £.(/) for 
the frequency spectrum Stf) corresponding to the signal *,<<) that represents said speaker's 
voice as well as a further estimate $ m '(/) for updating the estimate *„,(/) forthe noise 
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power density spectrum «,„ (/) of the statistically distributed background noise *■(,), and 
a second-multiplier element 107, which is used for correlating (S8a) the discrete signal 
spectrum Sfrlfi of a delayed version s(»r-x) of the andog-tc-digital-converted audio sig- 
nal s(„7) wtth an audio speech activity estimate obtained by an amplitude detection (S8b) 
of the band-pass-filtered discrete signal spectrum SOctfi, thus yielding an estimate S.(f) 
for the frequency spectrum m which corresponds to the signs! s<r) ma, represents' said 
speaker's voice and an estimate * m (/) f or the noise power density spectrum * (/) of 
said background noise „•(,). A sample-and-hold (S&H) element 106d' provides a Lnpled 
verston of me estimated noise power dereity specnnm *„(/). The noise 

reduction circuit 106 forther comprises a band-pass filter with adjustaMe cut-off frequen 
ces, which is used for filtering the discrete signal spectium w» of the anatog-fo-digital 
converted audio signai s(f). The cuf-off frequencies can be adjusted dependent on the band- 
wufth of tire estimated speech signal spectrum . A switch t06f is provided for seleo 

tively switching between a first and a second mode for receiving said speeeh signal s«) 
wtm and without using me proposed audfo-visnat speeeh recognition approach providing a 
notse-rednced speech signai s,(0 , respectively. According fo a further aspect of the pres . 
ent invention, means are provided for switching said microphone 101a off when the a«ual 
level of the speech activity indication signal i,(„r) falls below a predefined threshofd 
value (not shown). 



An example of a fas. camera-enhanced noise reduction and speech activity recognition 
system 200c for a telephony-based application which tap.emenfs an audio-visual speech 
activity estimation algorithm according to a former embodiment of the present invention is 
deptcted in Fig. 2c. The circuihy correlates a discrete signal spectrum of the analog- 

to-dtgifal-converted audio signal s(,) with a delayed version of an audio-visual speech ae- 
tmty estimate and a further speech activity estimate obtained by calculating the difference 
spectrum of the discrete sigmd spectium S»6fi and a sampled version «„ (* . A/) of ^ 
estimated noise power density spectrum «„„ (/) . The aforementioned audio-visual speech 
activity estimate is taken from an audio-visual feature vector ^ obtained by combining an 
audio feature vector ^ supplied by said audio feature extraction and analyzing means 
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Aside from the components described above with reference to Fig 1 the ^ , ■ 
add.„g (SUa) an audto speech activity estimate supptied ftom an audio feature extraeti 

ZTT S m ™ 106b (e * m d ™ for — « 

22 T ° f *• SPeakCT ' S VOi " - ™ - -ed on ^1 

«*d aud.o science «) to an visual speech activity eatimafe supphed from visual £ 
htre extinction ana ana,yzi„g mea ns ,04' and ,04" for detetmining the cut™, locati J" 
^spe^er-afaceatadaurateofl ftame/a, hacking hp movements and facial exZT 
of aaid speaker 5V a, a data rate of 15 fca.es/, and defining acoustic!™! 7 
characteristics of the apeaWa voice and pronunciation based on £T 

5 no.se reduction circuit 106 teher comprises a muhipUer e,eme», 107- whLh^T 
— g (SI ,b) the diacre.e signs, spectrum ^ of the -*££Z£ 
audto s-gna, sCt, with an audio-visua, speech activity estimate, obtained hy J££ 
and,, feature vector to supphed by said audio feature extraction and aLv.il 

> ulelO^-.ti.erehyyiCding an estimate?,^ for the fecptency specfcnn S(W wMch 

sponda to the signa, that represents the speaker's voice and an estimate S (/) forlhe 
noise power density spechmn 0„ (/) of the statistic^ diatrihuted backbond no ise 
»'«). A samp,e-and-ho,d (S&H) Cement ,06d> provides a aamp,*, ve.ion $ <* . ^ of 
tite estimafed „„ ise power density spectrum $„(/). The noise ^ ^ fe 

ther comprises a band-paas fi „er with adjustab.e cut-off freouencies. which is used for fil . 

Ts I TZ*" ^ ^ « - -dioH 
*). Satd cu off frequencies can be adjust dependent on the bandwidth of the eating 

*eech stgna, spectrem S l(f ). A switch ,06f is previded for se,ective,y switching be- 

tiveen a lira, and a second mode for receiving said speech signs, m with and without ua- 

•ng the proposed audio-visua, speech recogmtion approach providing a noise-rll 
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speech signal £,(/), respectively. According to a further aspect of the present invention, 
said noise reduction system 200c comprises means (SW) for switching said microphone 
101a off when the actual level of the speech activity indication signal S^nT) falls below a 
predefined threshold value (not shown). 

5 

A still further embodiment of the present invention is directed to a near-end speaker detec- 
tion method as shown in the flow chart depicted in Fig. 3a. Said method reduces the noise 
level of a recorded analog audio sequence sCO being interfered by a statistically distributed 
background noise «'(/), said audio sequence representing the voice of a speaker S*. After 

10 having subjected (SI) the analog audio sequence s(i) to an analog-to-digital conversion, the 
corresponding discrete signal spectrum S(hAf) of the analog-to-digital-converted audio 
sequence s(nT) is calculated (S2) by performing a Fast Fourier Transform (FFT) and the 
voice of said speaker *S/ is detected (S3) from said signal spectrum S(k*Af) by analyzing 
visual features extracted from a simultaneously with the recording of the analog audio se- 

15 quence s(t) recorded video sequence v(nT) tracking the current location of the speaker's 
face, lip movements and/or facial expressions of the speaker Si in subsequent images. Next, 
the noise power density spectrum O^, (/) of the statistically distributed background noise 
n 9 (t) is estimated (S4) based on the result of the speaker detection step (S3), whereupon a 
sampled version <X> ,,„(£• Af) of the estimated noise power density spectrum O ntt (/) is 

20 subtracted (S5) from the discrete spectrum S(k Af) of the analog-to-digital-converted audio 

sequence ^n-rhJ&nalL^^ i.friTQ of the ob- 
tained difference signal, which represents a discrete version of the recognized speech sig- 
nal, is calculated (S6) by performing an Inverse Fast Fourier Transform (IFFT). 

25 Optionally, a multi-channel acoustic echo cancellation algorithm which models echo path 
impulse responses by means of adaptive finite impulse response (FIR) filters and subtracts 
echo signals from the analog audio sequence s(t) can be conducted (S7) based on acoustic- 
phonetic speech characteristics derived by an algorithm for extracting visual features from 
a video sequence tracking the location of a speaker's face, lip movements and/or facial ex- 

30 pressions of the speaker S s in subsequent images. Said multi-channel acoustic echo cancel- 
lation algorithm thereby performs a double-talk detection procedure. 
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According to a further aspect of the invention, a learning procedure is applied which en- 
hances the step of detecting (S3) the voice of said speaker Si from the discrete signal spec- 
trum S(k-AJ) of the analog-to-digital-converted version s{nT) of an analog audio sequence 
s(t) by analyzing visual features extracted from a simultaneously with the recording of the 
analog audio sequence s(t) recorded video sequence tracking the current location of the 
speaker's face, Hp movements and/or facial expressions of the speakers, in subsequent im- 
ages. 

In one embodiment of the present invention, which is illustrated in the flow charts depicted 
in Figs. 3a+b, a near-end speaker detection method is proposed that is characterized by the 
step of correlating (S8a) the discrete signal spectrum S^k-Af) of a delayed version s(nT-x} 
of the analog-to-digital-converted audio signal s(nT) with an audio speech activity estimate 
obtained by an amplitude detection (S8b) of the band-pass-filtered discrete signal spectrum 
S(hAf), thereby yielding an estimate S, (/) for the frequency spectrum Stf) which corre- 
sponds to the signal s,{t) representing said speaker's voice and an estimate O n „(/) for the 
noise power density spectrum <!>„,,(/) of said background noise »(0 . Moreover, the dis- 
crete signal spectrum S^hAJ) of a delayed version s{nT-x) of the analog-:to-digital-con- 
verted audio signal s(nT) is correlated (S9) with a visual speech activity estimate taken 
from a visual feature vector o v ; which is supplied by the visual feature extraction and ana- 
lyzing means 104a+b and/or 104'+104", thus yielding a further estimate S,'(f) for up- 
dating the estimate £,(/) for the frequency spectrum Sff) which corresponds to the signal 
Si(t) representing the speaker's voice as well as a further estimate 0„ '(/) that is used for 
updating the estimate $„(/) for the noise power density spectrum <^ m {f) of the statisti- 
cally distributed background noise «*(/). The noise reduction circuit 106 thereby provides a 
band-pass filter 204 for filtering the discrete signal spectrum S(hAJ) of the analog-to- 
digital-converted audio signal s(t), wherein the cut-off frequencies of said band-pass filter 
204 are adjusted (S10) dependent on the bandwidth of the estimated speech signal spec- 
trum £,(/). 
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to a further embodiment of the present faction „ shown h me flow ^ fa 
F.gs. 3a+c a near-end speaker detection method is proposed which is charaeterized by tire 
step of adding (S„a) an audio speech activity estimate obtained hy an antphtude detection 
of the band-pass-filtered discrete signa. spectrun, S&Aft of the anatog-to-digitol-converted 
audto s ,gna. to a visual speech activity estimate token from a visual featine vector o 
supplied by said visual feature extraction and analyzing metms ,04a + b and/or 104' + 104" 
hereby yielding an audio-visual speeeh activity estimate. According to titis embodiment" 
me discrete signs! spectrum is COIIeIated (S1 lb) ^ ^ ^ ^ ^ ^ ' 

tty estimate, thus yiCding an estimate S l(f) f or me ^ 
utg to the signal « t) tha, represents said spier's voice as well as an estimate * (/) for 
me noise power density spectrum *,„</) of me stotistically dishibuted backgrould noise 
«•(/). The cut-off fluencies of me band-pass filter 204 tha, is used for filtering the dis- 
crete sigmd spectium ^ of the analog-to-digital-converted audio signal am ad- 
justed (SI 1c) dependent on the bandwidth of the estimated speech signal spectrum S,(f) . 

Finally, the present invent also t0 ^ „ rf . ^ ^ 

and a corresponding near-end speaker detection method as described above for a video-te- 

lephony baaed apphcation (e.g. a video conferee) in a telecommunication system running 

ha ? s a buiiwn wde ° camaa ,oib - " «» *» * " 

speTOparttotpafingTuTT^^ 

where a number of persons are sitting in one room equipped with many cameras and mi- 
crophones such tha, a speaker's voice interferes with me voices of the other persons 
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Sony Ericsson 
P27696EP/P1 



No. 



100 



101a 



Table: Depicted Features and Their Cnrr-sp ^ing Rgjereace Sk 



101a' 



101b 



no,se reduction and speech actmty recogmtion system having an audio^aluleT^ 
face, said system being specially adapted for running a real-time lip tracking application 
winch combines visual features o v , nT extracted from a digital video sequence v(„r) 
shown* the face of a speaker S, by detecting and analyzing the speaker's lip movements 
and/or faoal expressions with audio features <w extracted from an analog audio se 
quence s(t) representing the voice of said Speaker Si interfere* by a statistically distrib 
uted background noise „>(,), wherein said audio sequence j® includes - aside from the 
signal representing the voice of said speaker S t - both environmental noise „(,) and a 
we lg hted sum Z, «, Sj< t-Tj) 0 * 0 of surrounding persons' interfering voices in the envi- 
ronment of said speaker S$ 



mtcrophone, used for recordmg an analog aud .o sequence presenting the voice of a 
speaker S, interfered by a statistically distributed background noise „■(,), which includes 
both environmental noise „(,) and a weighted sun, Zj ajM'-W (with, « 0 of surround- 
ing persons- interfering voices in the environment of said speaker 51 



anatog-to-digital converter (ADC), used for co^etlmg .be analog audio sequence W> 
recorded by said microphone 101a into the digital domain 



vtdeo camera pointing to the speaker's face for teuudmg a vtdeo sequence showing Up 
movements and/or facial expressions of said speaker S, ~ 




v,sualtrontendofan automatic audio-vis ual speech recognition system 100 using , 
btmodal approach to speech recognition and near-speaker deteetion by incorporating a 
real-tune hp backing algorithm for deriving additional visual features from lip move- 
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104' 



104' 



!l04a 



Technical Feature (System Component or Procedure Step) 



ments and/or facial expressions of a speaker S t whose voice is interfered by a statistT 
cally distributed background noise n\t), the visual front end 104 comprising visual fea- 
ture extraction and analyzing means for continuously or intermittently determining the 
current location of the speaker's face, tracking lip movements and/or facial expressions 
of the speaker S t in subsequent images and determining acoustic-phonetic speech char- 
acteristics of the speaker's voice and pronunciation based on detected lip movements 
and/or facial expressions 



visual feature extraction module for continuously tracking lip movements and/or facial 
expressions of the speaker S ( and determining acoustic-phonetic speech characteristics 
of the speaker's voice based on detected lip movements and/or facial expressions 



visual speech activity detection module for analyzing the acoustic-phonetic speech char- 
acteristics and detecting speech activity of a speaker based on said analysis 



visual feature extraction means for continuously or intermittently determining the~c^ 
rent location of the speaker's face recorded by a video camera 101b at a rate of 1 frame/s 



visual feature extraction and analyzing means for continuously tracking lip movements 
and/or facial expressions of the speaker S t and determining acoustic-phonetic speech | 
characteristics of said speaker's voice based on detected lip movements and/or facial 
expressions at a rate of 1 5 frames/s 



106a 



noise reduction circuit being specially adapted to reduce statistically distributed back^ 
ground noi se „'(Q received by said mi crophone 101a and perform a near-speaker detec- j 
tion by separating the speaker's voice~fr^mlaTd^acl^^ 

bination of the speech characteristics which are derived by said audio and visual feature | 
extraction and analyzing means 104a-+-b and 106b, respectively 



digital signal processing means for calculating the discrete signal spectrum S(k £tf that , 
corresponds to an analog-to-digital-converted version S (nT) of the recorded audio se-| ! 
quence 5(0 by performing a Fast Fourier Transform (FFT) 



audio feature extraction and analyzing means (e.g. an amplitude detector) for detecting 
acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based | 
on the recorded audio sequence s(f) 



means tor estimating the noise power density spectrum O m (f) of the statistically dis- 
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VNo. 



[ Technical Feature (System Component or Procedure Step) 



triburea Aground noise *>(,) based on the result of the speaker detectioT^edu^ 
performed by said audio-visual feature extraction and analyzing means 104b, 106b 104' 
and/or 104" 



il06d 



I 106d' 



meaDS fo - estimating the signal spectrum S ti f) of the recorded speech sig^Obaied 
on the result of the speaker detection procedure performed by said audio-visual feature 
I extraction and ana lyzing means 104b, 1 06b, 104' and/or 104" 

'subtracting element for subtracting a discretized version WJFW) o f the estimated 
I noise power density spectrum 0 Bfl (/) from the discrete signal spectrum of the 

I analog-to-digital-converted audio sequence s(nT) 



sample-and-hold (S&H) element providing a sampled version O m (* . A/) of the esn^ 
mated noise power density spectrum ® M (f) 



106e 



106f 



107 



digital signal processmg means for calculating the corresponding discrete time-doma d 
I sxgnal S t (nT) of the obtained difference signal by performing an Inverse Fast Fourier [ 
[ Transform (IFFT) 

[switch for selectively switching between a first and a secon d mode for receiving said "! 
speech signal with and without using the proposed audio-visual speech recognition | 
approach providing a noise-reduced speech signal S t (t), respectively 



multiplier element, used for correlating the discrete signal spectrum S(k y) of the and 
log-to-digital-converted audio signal s (t) with an audio speech activity estimate which is 
obtained by an amplitude detection of the digital audio signal s(nT) 



multiplier element, used for correlating (SI lb) the discrete signal spectrum S(k AJ) of| 
the analog-to-digital-converted audio signal s(t) with an audio-visual speech activity es- 
timate, obtained by combining an audio feature vector^ supplied by said audio feature I 
extraction and analyzing means 106b with a visual feature vector o v , supplied by said 
visual speech activity detection module 104", thereby yielding an estimate £,(/) for 
the frequency spectrum S i( f) corresponding to the signal *(0 which represents said| 
speaker's voice and an estimate $ m (/) for me noise power density ^ (/) 

of the statistically distributed background noise n\t) 



107a 



[ Technical Feature (Sys tem Component or Procedure Step)" 
multiplier element, used for correlating (S9) the <hscrete signal spectrum S x (kA0o77 
delayed version s(nT-x) of the analog-to-digital-converted audio signal s( n T) with a vis- 
ual speech activity estimate taken from a visual feature vector supplied by the visual 
feature extraction and analyzing means 104a+b and/or 104N-104", thereby yielding a 
norther estimate for updating the estimate S l( » for the frequency spectrum 

\m corresponding to the signal s#) which represents said speaker's voice as well as a 
further estimate $„/(/) for updating the estimate ^ (/> for the noise power density 

I spectrum O^/) of the statistically distributed background noise «'(/) 



1071 



multiplier element, used for correlating (S8a) the discrete signal spectrum SfrAfl of a 
delayed version s(nT-r) of the analog-to^iigital-converted audio signal s(nT) with an 
audio speech activity estimate obtained by an amplitude detection (S8b) of the band- 
pass-filtered discrete signal spectrum S{k*f) t thereby yielding an estimate S, (/) for the 
frequency spectrum Stf) corresponding to the signal *(0 which represents said speaker's 
voice as well as an estimate O n „ (/ ) f or the noise power density spectrum of 
Ithe statistically distributed background noise n'(t) 



summation element, used for adding (Slla) the audio spee ch activity estimate to the 
visual speech activity estimate, thereby yielding an audio-visual speech activity estimate 



multi-channel acoustic echo cancellation unit being specially adapted to perform a near-" 
end_spe^h_d^ti^a^ 



speech characteristics derived by said audio and visual feature extraction and analyzmg 
means 104a+b and 106b, respectively 



10! 



I means for near-end talk and/or double-talk detection, integ rated in the multi-channe l 
acoustic echo cancellation unit 108 
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2C 



I block diagram showing a conventional noise reduction and speech activity recognition 
system for a telephony-based application based on an audio speech activity estimation 
according to the state of the art, wherein the discrete signal spectrum S(ht0 of the ana- 
log-to-digital-converted audio signal ,(,) is correlated with an audio speech activity es- 
timate which is obtained by an amplitude detection of the digital audio signal s(nT) 



[block diagram showing an example of a slow camera-enhanced noise redu^T^T 
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INo. 



Technical Feature (System Component or Procedure Step) 



speech activity recognition system for a telephony-based application implem^T^ 
audio-visual speech activity estimation algorithm according to one embodiment of the 
present invention, wherein the discrete signal spectrum S&Afi of a delayed version 
s(nT-x) of the analog-to-digital-converted audio signal S (nT) is correlated (S8a) with an 
audio speech activity estimate obtained by an amplitude detection (S8b) of the band- 
pass-filtered discrete signal spectnun 5(^A/), thereby yielding an estimate S,(f) for the 
frequency spectrum $ft corresponding to the signal s{Q which represents said speaker's 
voice and an estimate O n „(/) for the noise power density spectrum <!>„,, (/) of the sta- 
tistically distributed background noise „'(*), and also correlated (S9) with a visual 
speech activity estimate taken from a visual feature vector 2v , supplied by the visual 
feature extraction and analyzing means 104a+b and/or 104'+104'\. thereby yielding a 
further estimate S t >(f) for updating the estimate S f(/ ) for the frequency spectrum 
Uif) corresponding to the signal which represents said speaker's voice as well as a 
further estimate $„/(/) for updating the estimate $„„(/) for the noise power density 
I spectrum <S> m (f ) of the statistically distributed background noise «'(*) 
1 200c block diagram showing an example of a fast ca mera-enhanced noise reduction an d 
speech activity recognition system for a telephony-based application implementing an 
audio-visual speech activity estimation algorithm according to a further embodiment of 
the present invention, wherein the discrete signal spectrum S(k A/) of the analog-to- 
digital-converted audio signal s(Q is correlated (SI lb) with an audio-visual speech ac- 
tivity estimate, obtained by combining an audio feature vector o,, which is supplied by 
said audio feature extraction and analyzing means 106b with a visual feature vector o vi 
supplied by the visual speech activity detection module 104", thereby yielding an esti- 
mate *,(/) for the corresponding frequency spectrum of the signal ^(0 which rep 
resents said speaker's voice as well as an estimate OJ/) for the noise power density 
I spectrum °™ CO of ^ statistically distributed background noise n'(r) 
delay element, providing a delayed version of the discrete signal spectrum S(kAfi of the' 
|analog-to-digital-converted audio signal s(i) 
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204 I band-pass filter with adjustable cut-off frequencies which can be adjusted dependent c5| 
the bandwidth of the estimated speech signal spectrum S f (/) , used for filtering the dis- 
crete signal spectrum S(k Af) of the analo g-to-digital-converted audio signal s(t) 
300a | flow chart illustrating a near-end speaker detection method reducing the noise level of a I 
detected analog audio sequence s(t) according to the embodiment depicted in Fig. 1 of | 
the present invention 



300b | flow chart illustrating a near-end speaker detection method according to the embodiment | 
depicted in Fig. 2b of the present invention 



300c | flow chart illustrating a near-end speaker detection method according to the embodiment | 
depicted in Fig. 2c of the present invention 



SW | means for switching said microphone 101a off when the actual level of the speech ac-i 

tivity indication signal S,jnT) falls below a predefined threshold value (not shown) 
81 ' step #1 : subjecting the analog audio sequence s(t) to an analog-to-digital conversion 



S1 ° I step #10: adjusting the cut-off frequencies of the band-pass filter 204 used for filtering | 
the discrete signal spectrum S(fcAf) of the analog-to-digital-converted audio signal (*(/)) 
dependent on the bandwidth of the estimated speech signal spectrum $(/) 



Slla I step #lla: adding an audio speech activity estimate which is obtained by an amplitude 
detection of the band-pass-filtered discrete signal spectrum S(hAf) of the analog-to- 1 
digital-converted audio signal s(t) to a visual speech activity estimate taken from a vis- 
luarfeatu^ 

1 104a+b and/or 104V 104", thereby yielding an audio-visual speech activity estimate 



Sllb [step #llb: correlating the discrete signal spectrum S(hAJ) with the audio-visual speech! 
activity estimate, thereby yielding an estimate $</) for the frequency spectrum Stf) | 
corresponding to the signal s,{f) which represents said speaker's voice as well as an es- 
trniate <S> m (f) for said noise power density spectrum (/) 



SllG I step #1 lc: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering 
the discrete signal spectrum S(hAf) of the analog-to-digital-converted audio signal s(t) \ 
dependent on the bandwidth of the estimated speech signal spectrum £.(/) 
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[No. 



S2 



S3 



S4 



S5 



S6 



S7 



|S8o 



S8a 



S8b 



Technical Feature (System Component or Procedure Sten^ 



^P^^^^^^^ te signal spectrurn^^Fth^alo^ 
d^tal-converted audio se q uer^ ^ onning a Fast Fourier « 1 



^ " 3 : *• voice of STd 5 i~ said sjgnaJ spectrumlof^l^ 

.yztng visual feahnes ex.ao.ed from a stoutly with „ e recording 
^uence recorded video seouence for tracking the current .oca«ion of the ^ 
race, hp movements and/or fac ial expressions o f the speaker * in subsequent ima _ 
step M: eshmating tie noise power denstty specmun *_ (/) of the atafi^dStriT 
background noise „•(*) based on tteresuh of me speaker detection - fr „ S3 



SfeP ' "'screnzed version .y) of me t ^ ^ r 

stty specbum *„ (/) fro m ^ d^ signa , ^ rf ^ ^ ^ 

converted audio sequence ,y(/i7) 



step ca.cu.adng the corresponding discrete toe-domain signal i,o 5wg5H 
tamed difference signal by pcrfonning an Inverse FastFourier Transform ffFFT, 



— — — *» 11 UWAAU 1 11 ' f I 

^ " : ° ~ echo cancelation algorithm which mo diS 

echo path nnpulse responses by means of adaptive finite hnpu.se resp^se (Pra ^ 
and s^nac, echo signa,s from te ^ ^ seqttence ^ ^ 
Phonchc ^eech characteristics derived by an aigorithm lor extracting visua. features 

™ Se,UenCe "* to — ° f * Hp movement and/or 

fecal expressions of the speaker S, in subsequen t images 

I srep ««o: band-pass-fUtering me discrete signa. spectrum Stf^j of S^SESaB: 
I tal-converted audio signal s(»J) S 



step #8, correUUing me discrSel.gnal spcchnm ^ of a de.ayed ^onl^d 

rbriTr^" 6 " 1 au<u ° si8Mi ^ •* - — ll 

1 0 '*™°J by the amphmde detection step S8b 



H — ~ of me b m d.pasl^.d iscrete ^ 5pectro m ^ r 

merebyyieidrng an estimate for the fre^y spectrum Sff) corresponding to the I 

signal j,(r) which represents said speaker's voice as w *ii « , ~ 

38 wel1 38 311 estimate <&„„ (/) f or the 

noise power density spectrum O nn (/) 
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No. 


Technical Feature (System Component or Procedure Step) 


S9 


step #9: correlating the discrete signal spectrum £(*-A/) of a delayed version s{nT-^) o7 
the analog-to-digital-converted audio signal s(nT) with a visual speech activity estimate 
taken from a visual feature vector o v ,, supplied by the visual feature extraction and ana- 
lyzing means 104a+b and/or 104N-104", thereby yielding a further estimate §'(/) for 
updating the estimate S s (f) for the frequency spectrum Stf) corresponding to the signal 
s,(t) which represents said speaker's voice as well as a further estimate <S> m '(/) for up- 
dating the estimate ®„ n (/) for the noise power density spectrum <& m (/) of the statis- 
tically distributed background noise «'(/) 


SIO 


step #10: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering the" 
discrete signal spectrum S(k £f) of the analog-to-digital-converted audio signal s(t) de- 
pendent on the bandwidth of the estimated speech signal spectrum S, (/) 


Slla 


step #1 la: adding an audio speech activity estimate obtained by an amplitude detection 
of the band-pass-filtered discrete signal spectrum S(hAf) of the analog-to-digital-con- 
verted audio signal s(f) to a visual speech activity estimate taken from a visual feature 
vector o Vit supplied by said visual feature extraction and analyzing means 104a+b, and/or 
1 04*+l 04' ', thereby yielding an audio-visual speech activity estimate 


SI lb 


step #llb: correlating the discrete signal spectrum S(& Afl with the audio-visual speech 

activity estimate, thus yielding an estimate S ( f) for the fremiMirv cri^tm.™ c//\ 

* j o t \j t uic liCLjucncy spectrum oji /) cor- 

reispoJKUngjQ_JbjLSign^ voice as well as an esti- 




mate ® m (f) for the noise power density spectrum O m (/) of the statistically distrib- 
uted background noise «'(*) 


Sllc 


step #llc: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering" 
the discrete signal spectrum S(kAf) of the analog-to-digital-converted audio signal s(t) 
dependent on the bandwidth of the estimated speech signal spectrum S, (/) 
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Claims OZOkt. 2003 

• 

1 . A noise reduction system with an audio-visual user interface, said system being specially 
adapted for running an application for combining visual features (o v , nT ) extracted from a 
digital video sequence (v(„7)) showing the face of a speaker ($) with audio features fcw) 
extracted from an analog audio sequence (s(t», wherein said audio sequence (*(,)) can in- 
clude noise in the environment of said speaker (£), said noise reduction system (200b/c) 
comprising 

- means (101a, 106b) for detecting and analyzing said analog audio sequence (,(,)), 

- means (101b») for detecting said video sequence (v(«Z)), and 

- means (104a+b, 104'+104") for analyzing the detected video signal (v(nT», 
characterized by 

a noise reduction circuit (106) being adapted to separate the speaker's voice from said 
background noise („•(,)) based on a combination of derived speech characteristics -fe^. 
:= Knr T , <2v„rT) and outputting a speech activity indication signal (s t («r» which is 
obtained by a combination of speech activity estimates supplied by said analyzing means 
(106b, 104a+b, 104'+104"). 

2. A noise reduction system according to claim 1, 
characterized by 

means (SW) for switching off an audio channel in case the actual level of said speech ac- 
tivity indication signal (i, („r» falls below a predefined threshold value. 

3 . A noise reduction system according to anyone of the claims 1 or 2, 
characterized by 

a multi-channel acoustic echo cancellation unit (108) being specially adapted to perform a 
near-end speaker detection and double-talk detection algorithm based on acoustic-phonetic 
speech characteristics derived by said audio feature extraction and analyzing means (106b) 
and said visual feature extraction and analyzing means (104a+b, 104'+104»). 



16 



4. A noise reduction system according to anyone of the claims 1 to 3, 
characterized in that 

said audio feature extraction and analyzing means (106b) is an amplitude detector. 

5. A near-end speaker detection method reducing the noise level of a detected analog audio 
sequence ($(/)), 

said method being characterized by the following steps: 

- subjecting (SI) said analog audio sequence (s(t)) to an analog-to-digital conversion, 

- calculating (S2) the corresponding discrete signal spectrum 0**40) of the analog-to- 
digital-converted audio sequence (s(nT» by performing a Fast Fourier Transform (FFT), 

- detecting (S3) the voice of said speaker (S t ) from said signal spectrum (SQhAf)) by ana- ' 
lyzing visual features Gw) extracted from a simultaneously with the recording of the 
analog audio sequence (,(0) recorded video sequence (v(«7)) tracking the current loca- 
tion of the speaker's face, lip movements and/or facial expressions of the speaker (S,) in 
subsequent images, 

- estimating (S4) the noise power density spectrum (<D M (/)) c f the statistically distrib- 
uted background noise (»(,)) based on the result of the speaker detection step (S3), 

- subtracting (S5)a discretized version (O^ (k ■ Af)) of the estimated noise power den- 
sity spectrum ($ M (/)) from the discrete signal spectrum (S(* A0) of the analog-to- 
digital-converted audio sequence (s(nT)\ and 

_=^alcnlatmg^3S^^ 

difference signal by performing an Inverse Fast Fourier Transform (IFFT), thereby 
yielding a discrete version of the recognized speech signal. 

6. A near-end speaker detection method according to claim 5, 
characterized by the step of 

conducting (S7) a multi-channel acoustic echo cancellation algorithm which models echo 
path impulse responses by means of adaptive finite impulse response (FIR) filtered sub- 
tracts echo signals from the analog audio sequence (,(,)) based on acoustic-phonetic speech 
characteristics derived by an algorithm for extracting visual features (o v , nT ) from a video 
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sequence (v(nT)) tracking the location of a sneaker'* fw r 

lun 01 a speaker s face, hp movements and/or facial ev 
pressions of the speaker (S d in subsequent images. 

7. A near-end speaker detection method according to claim 6, 
5 characterized in that 

said multi-channel acoustic echo canceiiaK.™ - 

procedure. ' latl0n alg0n,l,m »«*— * douWe-talk detection 

a. A near-end speaker detection method according ,o anyone of the daims 5 ,o 7 
0 characterized in that ' 

said acoustic-phonetic speech characteristics are based on the opening of a speaker's mo„, h 

as an estimate of the acoustic energy of articulated vowels or 

raptd movement of the speaker's hps as a Mut to labia, or -abio-dentaj cousin 

posttion and movement of the hps and the voice and pronunciation of said spea^ 

a leantin, . pro cedure us* for enhancing the step of detecting (S3) the voice of said speaker 

«W» of an ana,og audio seance «,» hy analyzing ^ ^ -B ^- 
horn a sunuhaneoustv with the recording of the analog audio se q uc„ce^ ) " 
vtdeo sequence Mb 7» hacking the current location of the speaker's face 1 
and/or faciei expressions of the speaker W fa subsequent C ' " "~ 

10. A near-end speaker detection metitod according to anyone of the claims 5 to 9 
characterized by the step of 'aims i to », 

correlating (S8a) tite discrete signal specteum Mm of a delayed version « nT T » of 
-iog.o-digi^nver.ed audio sign a, « n7 » with m Mdio ^ ^ZeT 

OX* 4/)), thereby yielding an estimate f £ f /•» for the ft. 

nate l0| )) for the frequency spectrum (Sffl) corre- 
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sponding to the signal ($,{/)) which represents said speaker's voice as well as an estimate 
(®nn(f)) for the noise power density spectrum (O nn (/)) of the statistically distributed 
background noise (n'(t)). 

5 1 1 . A near-end speaker detection method according to claim 10, 
characterized by the step of 

correlating (S9) the discrete signal spectrum (S x (kAfi) of a delayed version (s(nT-x)) of the 
analog-to-digital-converted audio signal (s(nT)) with a visual speech activity estimate taken 
from a visual feature vector (o v J) supplied by the visual feature extraction and analyzing 
10 means(104a+b, 104*-H04"), thereby yielding a further estimate (£*(/)) for updating the 
estimate (S^/)) for the frequency spectrum corresponding to the signal (pffi) 

which represents said speaker's voice as well as a further estimate (d^ *(/)) for updating 
the estimate (<D „,(/)) for the noise power density spectrum (O „„(/)) of the statistically 
distributed backgroimd noise (n y (t}). 

15 

12. A near-end speaker detection method according anyone of the claims 10 or 1 1, 
characterized by the step of 

adjusting (S10) the cut-off frequencies of a band-pass filter (204) used for filtering the dis- 
crete signal spectrum (S(hAJ)) of the analog-to-digital-converted audio signal (s(t)) de- 
20 pendent on the bandwidth of the estimated speech signal spectrum (S. (/)) . 



13. A near-end speaker detection method according to anyone of the claims 5 to 9, 
characterized by the steps of 

- adding (S 1 la) an audio speech activity estimate obtained by an amplitude detection of 
the band-pass-filtered discrete signal spectrum (S(k AJ)) of the analog-to-digital- 
converted audio signal (s(t)) to a visual speech activity estimate taken from a visual 
feature vector (o Vtt ) supplied by said visual feature extraction and analyzing means 
(104a+b, 104*+ 104"), thereby yielding an audio-visual speech activity estimate, 

- correlating (SI lb) the discrete signal spectrum (S(hAf)) with the audio-visual speech 
activity estimate, thereby yielding an estimate (S ( (/)) for the frequency spectrum (£{/)) 
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conning ,„ the signal W ,» which repress said speaker's voiee as well as an 
— — (•-(/» fortoenoisepowerdensttyspec.mm <*„(,)) .f «he sh,«isUeall y 
distributed background noise («'(*)) and 
- adjusting (S Ho) the cut-off frequencies of a band . paS5 ^ ^ ^ for fi 

dtscrete signa! spectrum of analog-to-digital-convened audiosigna, W) 

dependent on the bandwidth of the estimated speech signal spectrum ($ (/)) . 

14. Use of a noise reduction system (20 0b/c) seconding .anyone of me claims I to 4 and a 
near-end speaker detection method according to anyone of the claims 5 ,o 13 for a video- 
tetephony based apphcation in a telecommunication system running on a video-enabted 
Phone wtth a bui.t-m video camem (101b, pointing a. the fac of a speaker W participat . 
mg ma video telephony session. <> r<*ucipat 

15. A telecommunication device equipped with an audio-visual user interface 
characterized by 

noise reduction system (200b/c) according to anyone of the claims 1 to 4. 
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P27696EP/P1 

Abstract 



The present invention generally relates to the field of noise reduction systems which are 
equipped with an audio-visual user interface, in particular to an audio-visual speech activ- 
ity recognition system (200b/c) of a video-enabled telecommunication device which runs a 
real-time lip tracking application that can advantageously be used for a near-speaker detec- 
tion algorithm in an environment where a speaker's voice is interfered by a statistically 
distributed background noise („>(,)) including both environmental noise (*(,)) and sur- 
rounding persons' voices Qij a r s A t-Tj) withy * i). Said real-time lip tracking application 
combines a visual feature vector (o v , nT ) that comprises features extracted from a digital 
video sequence (v(„7)) showing the speaker's face by detecting and analyzing lip move- 
ments and facial expressions of said speaker ft) with an audio feature vector fcw) which 
comprises features extracted from a recorded analog audio sequence (,(,)) representing the 
voice of said speaker (6-,) interfered by said background noise («'(*)). 

According to one embodiment, the noise reduction system (200b/c) comprises audio fea- 
ture extraction and analyzing means (106b) for deriving acoustic-phonetic speech charac- 
teristics of the speaker's voice and pronunciation from said audio sequence (,«)) as well as 
visual feature extraction and analyzing means (104a+b, 104'+104") for detecting the cur- 
rent location of the speaker's face and tracking lip movements and facial expressions of the 
speaker (£) in subsequent images. A noise reduction circuit (106), which is adapted to per- 
form a near-speaker detection by separating the speaker's voice from said background 
noise based on a combination of derived speech characteristics fe^. := = fc^r 
outputs a speech activity indication signal (S^nT)) obtained by a combination of speech 
activity estimates supplied by said audio feature extraction and analyzing means (106b) as 
well as said visual feature extraction and analyzing means (!04a+b, 104M-104"). More- 
over, means (SW) are provided for switching said microphone (101a) off when the actual 
level of the speech activity indication signal (s s (nT)) falls below a predefined threshold 
value. 

(Figs. 2b+c) 
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